Netflix user behaviour

Requirements

Jupyter Notebook
Apache Toree
sampleDataNetflix.tsv placed in local filesystem and path updated in 1) below

Notes

  • I used a combination of Jupyter notebook and the Apache Toree project as it makes it easy and fast to explore a dataset.
  • I was part of the team that came up with Apache Toree (aka The Spark Kernel), and till now I think it's still the only Jupyter kernel that ties to a Spark Session and is backed by Apache. It solved many issues for us back when we were developing applications in Spark.

Future

  • I was hoping to use Voila project to create an interactive dashboard for data scientists where they could move a slider widget to change the parameters in my SQL queries, thus, change the time window to search. So, for example, say a data scientist would want to search for users only between 8 and 9 in the morning.
  • I wanted to randomly generate a bigger dataset using rules so that we could at least have more data to play with

1. Let's read our data

We will read in a TSV file and try to infer schema since it is not very complex data types we are using


In [104]:
val sessions = spark.read.option("header", "true")
                         .option("sep", "\t")
                         .option("inferSchema","true")
                         .csv("/Users/memo/Desktop/netflixSpark/sampleDataNetflix.tsv")


sessions = [user_id: int, navigation_page: string ... 5 more fields]
lastException: Throwable = null
Out[104]:
[user_id: int, navigation_page: string ... 5 more fields]

In [11]:
sessions.printSchema


root
 |-- user_id: integer (nullable = true)
 |-- navigation_page: string (nullable = true)
 |-- url: string (nullable = true)
 |-- session_id: integer (nullable = true)
 |-- date: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- timestamp: integer (nullable = true)


In [12]:
sessions.show(2)


+-------+---------------+--------------------+----------+--------+----+----------+
|user_id|navigation_page|                 url|session_id|    date|hour| timestamp|
+-------+---------------+--------------------+----------+--------+----+----------+
|   1001|       HomePage|https://www.netfl...|      6001|20181125|  11|1543145019|
|   1001| OriginalsGenre|https://www.netfl...|      6001|20181125|  11|1543144483|
+-------+---------------+--------------------+----------+--------+----+----------+
only showing top 2 rows

2. Let's create a temp SQL table to use of the SQL magic in Apache Toree to get our information


In [13]:
sessions.registerTempTable("SESSIONS")


warning: there was one deprecation warning; re-run with -deprecation for details

a) Find all users who have visited OurPlanetTitle Page.

Using DISTINCT to show unique users


In [16]:
%%SQL select distinct user_id 
from SESSIONS 
where navigation_page = 'OurPlanetTitle'


Out[16]:
+-------+
|user_id|
+-------+
|   1001|
|   2002|
|   2002|
|   4001|
|   3003|
+-------+

b) Find all users who have visited OurPlanetTitle Page only once.

Showing the page visits just for validation, can be easily removed from the projection list in query


In [25]:
%%SQL select user_id, count(user_id) as page_visits 
from SESSIONS 
where navigation_page = 'OurPlanetTitle' 
group by user_id
having page_visits == 1


Out[25]:
+-------+-----------+
|user_id|page_visits|
+-------+-----------+
|   4001|          1|
|   3003|          1|
|   1001|          1|
+-------+-----------+

c) Find all users who have visited HomePage -> OriginalsGenre -> OurPlanetTitle -> HomePage

Making sure we filter for the same path using the timestamps and making sure it's all within the same session_id


In [101]:
%%SQL select distinct a.user_id
from sessions a,
sessions b,
sessions c,
sessions d
where a.user_id = b.user_id
and b.user_id = c.user_id
and c.user_id = d.user_id
and a.navigation_page = 'HomePage'
and b.navigation_page = 'OriginalsGenre'
and c.navigation_page = 'OurPlanetTitle'
and d.navigation_page = 'HomePage'
and a.timestamp < b.timestamp
and b.timestamp < c.timestamp
and c.timestamp < d.timestamp
and a.session_id = b.session_id
and b.session_id = c.session_id
and c.session_id = d.session_id


Out[101]:
+-------+
|user_id|
+-------+
|   2002|
+-------+

d) Find all users who landed on LogIn Page from a Title Page

The like operator is not the most performant but the SQL optimizer should be able to tell that my 2nd where clause can improve selectivity of this query. I am using the timestamp column to make sure that a before landing on a Login page, the user first comes from a Title page


In [98]:
%%SQL select a.user_id
from sessions a,
sessions b
where a.user_id = b.user_id
and b.navigation_page = 'LogIn'
and a.navigation_page like '%Title'
and a.timestamp < b.timestamp


Out[98]:
+-------+
|user_id|
+-------+
|   3003|
+-------+

e) Find all users who have visited only OurPlanetTitle Page

We are using relation 'b' to get the total count of url the user has visited


In [82]:
%%SQL select a.user_id
from sessions a,
(select user_id, count(url) as totalUrl from sessions group by user_id) b
where a.user_id = b.user_id
and a.navigation_page = 'OurPlanetTitle'
and b.totalurl = 1


Out[82]:
+-------+
|user_id|
+-------+
|   4001|
+-------+